Categorizing Document Images into Script and Language Classes
نویسنده
چکیده
In order to properly archive and index large numbers of international documents, several challenging processing steps must be completed even before optical character recognition (OCR) can be applied. We present a system that preclassiies documents for further processing and OCR. The system operates in four phases: preprocessing (includ-We present a set of statistical techniques, based fundamentally on connected component analysis and horizontal projections, for the rst two phases. Even with little training, the system predicts the correct script category in 91% of the cases, when tested on real-life documents of varying kinds, diverse formats and qualities from many sources. The third and fourth phase are based on expert systems approaches. Language identii-cation combines several heuristics based on a statistical analysis of our training corpus. It currently has a 95% success rate on real-life documents of moderate quality. We will discuss the techniques and their combination and the process of improving performance.
منابع مشابه
Skew Detection, Page Segmentation, and Script Classiication of Printed Document Images
Automatic processing of international documents presents a number of challenging problems because Optical Character Recognition (OCR) techniques are not available for all languages and all script classes. Document images must be categorized according to their script type rst, in our case Roman, Ideographic, or Arabic. We present a set of statistical methods that rst detect and correct the skew ...
متن کاملDetermination of the Script and Language Content of Document Images
Most document recognition work to date has been performed on English text. Because of the large overlap of the character sets found in English and major Western European languages such as French and German, some extensions of the basic English capability to those languages have taken place. However, automatic language identification prior to optical character recognition is not commonly availab...
متن کاملScript and Language Identification for Document Images and Scene Texts
In recent times, there have been an increase in Optical Character Recognition (OCR) solutions for recognizing the text from scanned document images and scene-texts taken with the mobile devices. Many of these solutions works very good for individual script or language. But in multilingual environment such as in India, where a document image or scene-images may contain more than one language, th...
متن کاملNeural network based system for script identification in Indian documents
The paper describes a neural network-based script identification system which can be used in the machine reading of documents written in English, Hindi and Kannada language scripts. Script identification is a basic requirement in automation of document processing, in multi-script, multi-lingual environments. The system developed includes a feature extractor and a modular neural network. The fea...
متن کاملScript Identification for Document Image Retrieval: A Survey
In recent years there are many multimedia documents captured and stored with the advances in computer technology and hence the demand for recognizing and retrieval of such documents has increased tremendously .In such environment the large volume of data and variety of scripts make manual identification unworkable. In such cases the ability to automatically determine the script ,and further the...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 1998